R/Pharma 2025 Workshop
2025-11-07
Four parts of this workshop:
Python environment setup (Nan)
Use uv to create and manage reproducible Python projects. Develop and collaborate in GitHub Codespaces, Visual Studio Code, or Positron.
Python packages for clinical reporting (Yilong)
A guided tour of essential packages such as polars, plotnine, and rtflite, with demonstrations of creating TLFs commonly used in clinical trials.
Manage clinical trial A&R projects (Yilong)
Practical project structure, conventions, and execution from data to deliverables.
Prepare eCTD submission packages (Nan)
An example workflow for assembling submission-ready source code and outputs using py-pkglite, aligned with eCTD requirements.
The views and opinions expressed in this presentation are those of the individual presenters and do not represent those of their affiliated organizations or institutions.
With Python, learning how to:
Note
The toolchain, process, and formats may be different in different organizations. We only provide one common way to address them.
Note
Interested in R? check https://r4csr.org/
R/Pharma organizers
Team members from Meta Platforms and Merck & Co., Inc., Rahway, NJ, USA
Contributors of pycsr and r4csr training materials
In this workshop, we assume you have basic Python programming experience and clinical development knowledge.
adsl, adae, etc.Training material: https://pycsr.org/
During the workshop, we will use the pycsr project
We share the same automation philosophy as the R community described in Section 1.1 of the R Packages book and quote here.
Three recommended options:
GitHub Codespaces
Positron
VS Code
uv is a modern Python package and project manager written in Rust.
Replaces scattered toolchain:
pip + venv + pyenv + pip-tools + setuptoolsBenefits:
pyproject.toml as single source of truthSkip if using GitHub Codespaces: uv is pre-installed there.
macOS/Linux:
Windows:
Verify:
Ruff - Code formatting and linting
mypy - Type checking
pytest - Testing framework
All configured in pyproject.toml.
Virtual environments are mandatory in Python
Dependency locking
uv.lock pins exact versionsrenv.lock.python-version file
3.13.9)The ICH E3: structure and content of clinical study reports provide guidance to assist sponsors in the development of a CSR.
In a CSR, most of TLFs are located in:
Publicly available CDISC pilot study data located at the CDISC GitHub repository.
The dataset structure follows the CDISC Analysis Data Model (ADaM).
Source data: https://github.com/elong0527/r4csr/tree/main/data-adam
Converted parquet data: https://github.com/nanxstats/pycsr/tree/main/data
polars: Python package for data manipulation similar to dplyr/tidyr R packages
rtflite: Python package for creating production-ready tables and figures in RTF format similar to R package r2rtf
Modern Python dataframe library designed for performance and expressiveness.
Key advantages:
Core operations:
Counting participants:
Calculating percentages:
Pivoting to wide format:
Fill nulls for categorical counts:
Use typed literals for schema consistency:
Count unique subjects (not events):
In the pharmaceutical industry, RTF/Microsoft Word play a central role in preparing clinical study reports
Different organizations can have different table standards
For example, table layout, font size, border type, footnote, data source
rtflite is a Python package to create production-ready tables and figures in RTF format.
rtflite is designed to:
Before creating an RTF table, we need to:
Figure out table layout.
Split the layout into small tasks in the form of a computer program.
Execute the program.
Three-step process:
rtflite package provides the flexibility to customize table appearance for
rtflite package also provides the flexibility to convert figures in RTF format.
rtflite only focuses on table format. Data manipulation and analysis should be handled by other Python packages.
RTFTitle: Main and subtitle lines
RTFColumnHeader: Define column structure
RTFBody: Table content formatting
Create hierarchical column headers:
rtf_column_header=[
rtf.RTFColumnHeader(
text=["", "Placebo", "Xanomeline Low Dose", "Xanomeline High Dose"],
col_rel_width=[3] + [2] * 3
),
rtf.RTFColumnHeader(
text=["", "n", "(%)", "n", "(%)", "n", "(%)"],
col_rel_width=[3] + [1] * 6,
border_top=[""] + ["single"] * 6,
border_left=["single"] + ["single", ""] * 3
)
]Multiple tables in one document:
Conditional formatting:
Seven-step pattern seen across all examples:
Focus on one step at a time - break complex tables into manageable pieces.
Create functions for repeated operations:
Benefits:
Use indentation for subcategories:
Build tables row by row when needed:
Maintain sort order with Enum:
Use pandas bridge for statsmodels:
Key libraries:
statsmodels: Linear models, ANCOVAscipy.stats: Statistical tests, distributionsDo:
.n_unique() to count unique subjects (not .len() on event data)pl.lit(None, dtype=pl.Float64)Avoid:
.fill_null(0))Key concepts:
.pivot() to reshape data to wide format.fill_null(0)Key concepts:
Key concepts:
Key concepts:
Key concepts:
.n_unique()Key concepts:
.str.to_titlecase()A Python package designed specifically to organize analysis scripts and code for a clinical trial project.
Purpose:
Combines:
demo-py-esub/
├── pyproject.toml # Project metadata
├── .python-version # Python version
├── uv.lock # Locked dependencies
├── src/demo001/ # Study-specific code
│ ├── __init__.py
│ └── utils.py
├── analysis/ # Quarto analysis docs
│ └── tlf-*.qmd
├── data/ # ADaM datasets
├── output/ # Generated TLFs
└── tests/ # Validation tests
Consistency
Reproducibility
uv.lock pins dependencies.python-version specifies PythonAutomation
uv sync restores environmentquarto render generates outputspytest validates codeCompliance
Core principle: All project assets in version control.
Plain text workflow:
.qmd files for analysis (not .ipynb for final deliverables).md files for documentation.toml files for configuration.xlsx files for trackingProject tracking:
Planning:
Development:
analysis/ and src/Validation:
tests/ruff, mypy, pytest)Delivery:
quarto renderFDA Study Data Technical Conformance Guide Section 4.1.2.10:
Submit programs for primary and secondary efficacy analyses. Specify software in ADRG. Use ASCII text format. No executable extensions.
Goal: Enable reviewers to understand and confirm analysis algorithms.
Analysis package: https://github.com/elong0527/demo-py-esub
Submission package: https://github.com/elong0527/demo-py-ectd
Clone and explore to see complete examples.
m5/datasets/<study-id>/analysis/adam/
├── datasets/
│ ├── *.xpt # ADaM datasets
│ ├── define.xml
│ ├── adrg.pdf # Instructions
│ └── analysis-results-metadata.pdf
└── programs/
├── py0pkgs.txt # Packed Python package
├── tlf-01-*.txt # Analysis programs
└── tlf-02-*.txt
Key: All files in programs/ must be ASCII text.
Packs Python projects into portable text files.
Why needed:
pkglite capabilities:
.txt fileDocumentation: https://pharmaverse.github.io/py-pkglite/
1. Create .pkgliteignore
2. Pack the package
3. Convert Quarto to Python scripts
.qmd -> verify it works.qmd -> .ipynb -> .pyruff.txt (no .py extension)Human-readable Debian Control File (DCF) format:
# Generated by py-pkglite
# Use `pkglite unpack` to restore
Package: demo-py-esub
File: pyproject.toml
Format: text
Content:
[project]
name = "demo001"
version = "0.1.0"
...
Reviewers can read without special tools.
Document the Python environment:
Python environment:
| Software | Version | Description |
|---|---|---|
| Python | 3.13.9 | Programming language |
| uv | 0.9.7 | Package manager |
Packages:
| Package | Version | Description |
|---|---|---|
| polars | 1.35.1 | Data manipulation |
| rtflite | 1.0.2 | RTF generation |
| demo001 | 0.1.0 | Study functions |
Appendix: Step-by-step reproduction instructions.
Essential: Simulate reviewer experience before submission.
Workflow:
uvx pkglite unpack programs/py0pkgs.txt -o .cd demo-py-esub && uv syncpython ../programs/tlf-*.txtCatches: Missing dependencies, path errors, platform issues.
Book:
Regulatory:
Technical: